Script and Language Identification in Degraded and Distorted Document Images
نویسندگان
چکیده
This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages through document vectorization, which transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character and word images. We first identify scripts based on the density and distribution of vertical runs between character strokes and a vertical scan line. Latin-based languages are then differentiated using a set of word shape codes constructed using horizontal word runs and character extremum points. Experimental results show that our method is tolerant to noise, document degradation, and slight document skew and attains an average identification rate over 95%.
منابع مشابه
Language Identification in Degraded and Distorted Document Images
This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, char...
متن کاملDegraded Script Identification for Indian Language- A Survey
The working module of any Optical character Recognition system almost depends upon printing and paper of the input document image. A number of OCR techniques are available and claim correctly identified accuracy in printed document image in Indian and foreign script. A few report have been found on the recognition of the degraded Indian language document. The degradation in any scanned printed ...
متن کاملScript and Language Identification for Document Images and Scene Texts
In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, th...
متن کاملScript Identification for Document Image Retrieval: A Survey
In recent years there are many multimedia documents captured and stored with the advances in computer technology and hence the demand for recognizing and retrieval of such documents has increased tremendously .In such environment the large volume of data and variety of scripts make manual identification unworkable. In such cases the ability to automatically determine the script ,and further the...
متن کاملDensity Based Script Identification of a Multilingual Document Image
Automatic Pattern Recognition field has witnessed enormous growth in the past few decades. Being an essential element of Pattern Recognition, Document Image Analysis is the procedure of analyzing a document image with the intention of working out the contents so that they can be manipulated as per the requirements at various levels. It involves various procedures like document classification, o...
متن کامل